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Non-linear state space models are a widely-used class of models for biological, economic, and 
physical processes. Fitting these models to observed data is a difficult inference problem that 
has no straightforward solution. We take a Bayesian approach to the inference of unknown 
parameters of a non-linear state model; this, in turn, requires the availability of efficient Markov 
Chain Monte Carlo (MCMC) sampling methods for the latent (hidden) variables and model 
parameters. Using the ensemble technique of Neal (2010) and the embedded HMM technique 
of Neal (2003), we introduce a new Markov Chain Monte Carlo method for non-linear state 
space models. The key idea is to perform parameter updates conditional on an enormously large 
ensemble of latent sequences, as opposed to a single sequence, as with existing methods. We 
look at the performance of this ensemble method when doing Bayesian inference in the Ricker 
model of population dynamics. We show that for this problem, the ensemble method is vastly 
more efficient than a simple Metropolis method, as well as 1.9 to 12.0 times more efficient than 
a single-sequence embedded HMM method, when all methods are tuned appropriately. We also 
introduce a way of speeding up the ensemble method by performing partial backward passes to 
discard poor proposals at low computational cost, resulting in a final efficiency gain of 3.4 to 20.4 
times over the single-sequence method. 



1 Introduction 

Consider an observed sequence zi, . . . , z^. In a state space model for Z ± , . . . , Z N , the distribution 
of the Zj's is defined using a latent (hidden) Markov process X\, . . . ,X^. We can describe such a 
model in terms of a distribution for the first hidden state, p(xi), transition probabilities between 
hidden states, p(xi\xi-i), and emission probabilities, p(zi\xi), with these distributions dependent 
on some unknown parameters 9. 
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While the state space model framework is very general, only two state space models, Hidden 
Markov Models (HMM's) and linear Gaussian models have efficient, exact inference algorithms. 
The forward-backward algorithm for HMM's and the Kalman filter for linear Gaussian models 
allow us to perform efficient inference of the latent process, which in turn allows us to perform 
efficient parameter inference, using an algorithm such as Expectation-Maximizaton for maximum 
likelihood inference, or various MCMC methods for Bayesian inference. No such exact and efficient 
algorithms exist for models with a continuous state space with non-linear state dynamics, non- 
Gaussian transition distributions, or non-Gaussian emission distributions, such as the Ricker 
model we consider later in this paper. 

In cases where we can write down a likelihood function for the model parameters conditional 
on latent and observed variables, it is possible to perform Bayesian inference for the parameters 
and the latent variables by making use of sampling methods such as MCMC. For example, one 
can perform inference for the latent variables and the parameters by alternately updating them 
according to their joint posterior. 

Sampling of the latent state sequence x — (x±, . . . , xn) is difficult for state space models when 
the state space process has strong dependencies — for example, when the transitions between 
states are nearly deterministic. To see why, suppose we sample from tt(xi, . . . , x^\zi, . . . , z^) 
using Gibbs sampling, which samples the latent variables one at a time, conditional on values of 
other latent variables, the observed data, and the model parameters. In the presence of strong 
dependencies within the state sequence, the conditional distribution of a latent variable will be 
highly concentrated, and we will only be able to change it slightly at each variable update, 
even when the marginal distribution of this latent variable, given zi, . . . ,zn, is relatively diffuse. 
Consequently, exploration of the space of latent sequences will be slow. 

The embedded HMM method of (Neal (2003), Neal, et al. (2004)) addresses this problem by 
updating the entire latent sequence at once. The idea is to temporarily reduce the state space 
of the model, which may be countably infinite or continuous, to a finite collection of randomly 
generated "pool" states at each time point. If the transitions between states are Markov, this 
reduced model is an HMM, for which we can use the forward-backward algorithm to efficiently 
sample a sequence with values in the pools at each time point. Pool states are chosen from a 
distribution that assigns positive probability to all possible state values, allowing us to explore the 
entire space of latent sequences in accordance with their exact distribution. Neal (2003) showed 
that when there are strong dependencies in the state sequence, the embedded HMM method 
performs better than conventional Metropolis methods at sampling latent state sequences. 

In our paper, we first look at an MCMC method which combines embedded HMM updates of 
the hidden state sequence with random-walk Metropolis updates of the parameters. We call this 
method the 'single-sequence' method. We next reformulate the embedded HMM method as an 
ensemble MCMC method. Ensemble MCMC allows multiple candidate points to be considered 
simultaneously when a proposal is made. This allows us to consider an extension of the embedded 
HMM method for inference of the model parameters when they are unknown. We refer to this 
extension as the ensemble embedded HMM method. We then introduce and describe a "staged" 
method, which makes ensemble MCMC more efficient by rejecting poor proposals at low com- 
putational cost after looking at a part of the observed sequence. We use the single-sequence, 
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ensemble, and ensemble with staging methods to perform Bayesian inference in the Ricker model 
of population dynamics, comparing the performance of these new methods to each other, and to 
a simple Metropolis sampler. 

2 Ensemble MCMC 

We first describe Ensemble MCMC, introduced by Neal (2010) as a general MCMC method, 
before describing its application to inference in state space models. 

Ensemble MCMC is based on the framework of MCMC using a temporary mapping. Suppose 
we want to sample from a distribution 7r on X. This can be done using a Markov chain with 
transition probablity from x to x' given by T(x'\x), for which it is an invariant distribution - 
that is, T must satisfy J 7r(x)T(x'\x)dx = tt(x'). The temporary mapping strategy defines T as 
a composition of three stochastic mappings. The current state x is stochastically mapped to a 
state y £ y using the mapping T(y\x). Here, the space y need not be the same as the space X. 
The state y is then updated to y' using the mapping T(y f \y), and finally a state x' is obtained 
using the mapping T(x'\y'). In this approach, we may choose whatever mappings we want, so 
long as the overall transition T leaves n invariant. In particular, if p is a density for y, T will 
leave 7r invariant if the following conditions hold. 

J n(x)f(y\x)dx = p{y) (1) 

J p{y)T{y'\y)dy = p{y') (2) 

J p(y')f(x'\y')dy' = n(x') (3) 

In the ensemble method, we take y = X K , with y = (x^\ . . . , x^) referred to as an "ensem- 
ble" , with K being the number of ensemble elements. The three mappings are then constructed 
as follows. Consider an "ensemble base measure" over ensembles (x^\ . . . , x^) with density 
((x^\. . . , x^), and with marginal densities (k(x^) for each of the k = \,...,K ensemble 
elements. We define T as 

f( X W,...,x^\x) = l^C- fc |,(^ ( - fc) k)4(x (fc) ) (4) 

k=l 

Here, 5 X is a distribution that places a point mass at x, x^^ is all of x^\ . . . ,x^ except x^- k \ 
and C-k\k(x^ kS) \x^) = ((x^\ . . . , x^) / ( k (x^) is the conditional density of all ensemble elements 
except the k-th, given the value x^ for the k-th. 

This mapping can be interpreted as follows. First, we select an integer k, from a uniform 
distribution on {1, . . . , K}. Then, we set the ensemble element x^ to x, the current state. 
Finally, we generate the remaining elements of the ensemble using the conditional density C-fe|fe- 
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The ensemble density p is determined by it and T, and is given explicitly as 

T can be any update (or sequence of updates) that leaves p invariant. For example, T could 
be a Metropolis update for y, with a proposal drawn from some symmetrical proposal density. 
Finally, T maps from y' to x' by randomly setting x' to x^-* with fc chosen from {1, . . . , K} with 
probabilities proportional to ir(x^)/(k(x^)- 

The mappings descibed above satisfy the necessary properties to make them a valid update, 
in the sense of preserving the stationary distribution n. The proof can be found in Neal (2010). 

3 Embedded HMM MCMC as an 
Ensemble MCMC method 

The embedded HMM method briefly described in the introduction was not initially introduced as 
an ensemble MCMC method, but it can be reformulated as one. We assume here that we are inter- 
ested in sampling from the posterior distribution of the state sequences, ir(xi, . . . , xn\zi, . . . , zn), 
when the parameters of the model are known. Suppose the current state sequence is x = 
(xi, . . . ,xn). We want to update this state sequence in a way that leaves n invariant. 

The first step of the embedded HMM method is to temporarily reduce the state space to a 
finite number of possible states at each time, turning our model into an HMM. This is done by, 
for each time i, generating a set of L "pool" states, Pj = {xf \ . . . , xf^}, as follows. We first set 
the pool state xf 1 to x^, the value of the current state sequence at time i. The remaining L — 1 
pool states xf\ for I > 1 are generated by sampling independently from some pool distribution 
with density /tj. The collections of pool states at different times are selected independently of 
each other. The total number of sequences we can then construct using these pool states, by 
choosing one state from the pool at each time, is K = L N . 

The second step of the embedded HMM method is to choose a state sequence composed of 
pool states, with the probability of such a state sequence, x, being proportional to 

N N 

q(x\z u . . . ,z N ) oc p(xi) JJpfolxi-i) J] 

i=2 i=l 

We can define ^(z^Xi) = p(zi\xi)/Ki(xi), and rewrite (6) as 

N N 

q(x\z 1: . . . ,z N ) oc p(x 1 ) JJp(xj|xj_i) JJlfeliEi) (7) 

i=2 i=l 



p(Zi\Xj) 
KiiXi) 



(6) 
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We now note that the distribution (7) takes the same form as the distribution over hidden state 
sequences for an HMM in which each ij e Pj — the initial state distribution is proportional 
to p(xi), the transition probabilities are proportional to p(xi\xi-i), and the ^(z^Xi) have the 
same role as emission probabilities. This allows us to use the well-known forward-backward 
algorithms for HMM's (reviewed by Scott (2002)) to efficiently sample hidden state sequences 
composed of pool states. To sample a sequence with the embedded HMM method, we first 
compute the "forward" probabilities. Then, using a stochastic "backwards" pass, we select a 
state sequence composed of pool states. (We can alternately compute backward probabilities and 
then do a stochastic forward pass). We emphasize that having an efficient algorithm to sample 
state sequences is crucial for the embedded HMM method. The number of possible sequences we 
can compose from the pool states, L N , can be very large, and so naive sampling methods would 
be impractical. 

In detail, for Xi G Pj, the forward probabilities a^Xj) are computed using a recursion that 
goes forward in time, starting from % — 1. We start by computing a,\{x\) = p(xi) r y(zi\xi). Then, 
for 1 < i < N, the forward probabilities a^x) are given by the recursion 

L 

ai{xi) = ^{z i \x i )y^ j p{x i \xf_ 1 )a i -i(xf_ 1 ), for x e Pi (8) 
i=i 

The stochastic backwards recursion samples a state sequence, one state at a time, beginning with 
the state at time N. First, we sample xn from the pool Pjv with probabilities proportional to 
®n(xn)- Then, going backwards in time for i from N — 1 to 1, we sample Xi from the pool Pj 
with probabilities proportional to p(xi + i\xi)ai(xi) , where Xi + \ is the variable just sampled at time 
% + 1. Both of these recursions are commonly implemented using logarithms of probabilities to 
avoid numerical underflow. 

Let us now reformulate the embedded HMM method as an ensemble MCMC method. The 
step of choosing the pool states can be thought of as performing a mapping T which takes a single 
hidden state sequence x and maps it to an ensemble of K state sequences y = (x^\ . . . ,x^), 
with x = x^ for some k chosen uniformly from {1, . . . , K}. (However, we note that in this 
context, the order in the ensemble does not matter.) 

Since the randomly chosen pool states are independent under /tj at each time, and across 
time as well, the density of an ensemble of hidden state sequences in the ensemble base measure, 
C, is defined through a product of )'s over the pool states and over time, and is non-zero 
for ensembles consisting of all sequences composed from the chosen set of pool states. The 
corresponding marginal density of a hidden state sequence x^ in the ensemble base measure is 

N 

a(* (fe) ) = n^ (fc) ) ( ) 

1=1 

Together, ( and Cfe define the conditional distribution (-k\k, which is used to define T. The 
mapping T is taken to be a null mapping that keeps the ensemble fixed at its current value, 
and the mapping T to a single state sequence is performed by selecting a sequence x^ from the 
ensemble with probabilities given by (7), in the same way as in the embedded HMM method. 
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4 The single-sequence embedded HMM MCMC method 



Let us assume that the parameters 9 of our model are unknown, and that we want to sample 
from the joint posterior distribution of state sequences x = (x±, . . . , xn) and parameters 9 = 
... , 9p), with density ir(x,9\zi, . . . , zn). One way of doing this is by alternating embedded 
HMM updates of the state sequence with Metropolis updates of the parameters. Doing updates in 
this manner makes use of an ensemble to sample state sequences more efficiently in the presence of 
strong dependencies. However, this method only takes into account a single hidden state sequence 
when updating the parameters. 

The update for the sequence is identical to that in the embedded HMM method, with initial, 
transition and emission densities dependent on the current value of 9. In our case, we only consider 
simple random-walk Metropolis updates for 9, updating all of the variables simultaneously. 

Evaluating the likelihood conditional on x and z, as needed for Metropolis parameter updates, 
is computationally inexpensive relative to updates of the state sequence, which take time pro- 
portional to L 2 . It may be beneficial to perform several Metropolis parameter updates for every 
update of the state sequence, since this will not greatly increase the overall computational cost, 
and allows us to obtain samples with lower autocorrelation time. 

5 An ensemble extension of the embedded HMM method 

When performing parameter updates, we can look at all possible state sequences composed of 
pool states by using an ensemble ((a^ 1 ),^), . . . , (x^ K \9)) that includes a parameter value 9, the 
same for each element of the ensemble. The update T could change both 9 and x^ k \ but in the 
method we will consider here, we only change 9. 

To see why updating 9 with an ensemble of sequences might be more efficient than updating 
9 given a single sequence, consider a Metropolis proposal in ensemble space, which proposes 
to change 9 for all of the ensemble elements, from 9 to 9*. Such an update can be accepted 
whenever there are some elements (x^ k \9*) in the proposed ensemble that make the ensemble 
density p((x^\ 9*), . . . , (x^ K \ 9*)) relatively large. That is, it is possible to accept a move in 
ensemble space with a high probability as long as some elements of the proposed ensemble, with 
the new 9*, lie in a region of high posterior density. This is at least as likely to happen as having 
a proposed 9* together with a single state sequence x lie in a region of high posterior density. 

Using ensembles makes sense when the ensemble density p can be computed in an efficient 
manner, in less than K times as long as it takes to compute p(x, 9\zi, . . . , zn) for a single hidden 
state sequence x. Otherwise, one could make K independent proposals to change x, which would 
have approximately the same computational cost as a single ensemble update, and likely be a 
more efficient way to sample x. In the application here, K = L N is enormous for typical values 
of N when L > 2, while computation of p takes time proportional only to NL 2 . 

In detail, to compute the ensemble density, we need to sum q(x^ k \9\zi, . . . ,zn) over all en- 
semble elements (x^ k \ 9), that is, over all hidden sequences which are composed of the pool states 
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at each time. The forward algorithm described above makes it possible to compute the ensemble 
density efficiently by summing the probabilities at the end of the forward recursion o:n(xn) over 
all x n G Pn- That is 

K L 
P((X^, 6),..., (*W 6)) OC 7r(0) ^ 0\Z1,...,Z N ) = 7r(0) (10) 

k=l 1=1 

where ir(6) is the prior density of 9. 

The ensemble extension of the embedded HMM method can be thought of as using an approx- 
imation to the marginal posterior of 9 when updating 9, since summing the posterior density over 
a large collection of hidden sequences approximates integrating over all such sequences. Larger 
updates of 9 may be possible with an ensemble because the marginal posterior of 9, given the 
data, is more diffuse than the conditional distribution of 9 given a fixed state sequence and the 
data. Note that since the ensemble of sequences is periodically changed, when new pool states 
are chosen, the ensemble method is a proper MCMC method that converges to the exact joint 
posterior, even though the set of sequences using pool states is restricted at each MCMC iteration. 

The ensemble method is more computationally expensive than the single-sequence method 
- it requires two forward passes to evaluate the ensemble density for two values of 9 and one 
backward pass to sample a hidden state sequence, whereas the single- sequence method requires 
only a single forward pass to compute the probabilities for every sequence in our ensemble, and 
a single backward pass to select a sequence. 

Some of this additional computational cost can be offset by reusing the same pool states to do 
multiple updates of 9. Once we have chosen a collection of pool states, and performed a forward 
pass to compute the ensemble density at the current value of 9, we can remember it. Proposing 
an update of 9 to 9* requires us to compute the ensemble density at 9* using a forward pass. 
Now, if this proposal is rejected, we can reuse the stored value of the ensemble density 9 when we 
make another proposal using the same collection of pool states. If this proposal is accepted, we 
can remember the ensemble density at the accepted value. Keeping the pool fixed, and saving the 
current value of the ensemble density therefore allows us to perform M ensemble updates with 
M + 1 forward passes, as opposed to 2M if we used a new pool for each update. 

With a large number of pool states, reusing the pool states for two or more updates has only a 
small impact on performance, since with any large collection of pool states we essentially integrate 
over the state space. However, pool states must still be updated occasionally, to ensure that the 
method samples from the exact joint posterior. 

6 Staged ensemble MCMC sampling 

Having to compute the ensemble density given the entire observed sequence for every proposal, 
even those that are obviously poor, is a source of inefficiency in the ensemble method. If poor 
proposals can be eliminated at a low computational cost, the ensemble method could be made 
more efficient. We could then afford to make our proposals larger, accepting occasional large 
proposals while rejecting others at little cost. 



7 



We propose to do this by performing "staged" updates. First, we choose a part of the observed 
sequence that we believe is representative of the whole sequence. Then, we propose to update 9 
to a 9* found using an ensemble update that only uses the part of the sequence we have chosen. If 
the proposal found by this "first stage" update is accepted, we perform a "second stage" ensemble 
update given the entire sequence, with 9* as the proposal. If the proposal at the first stage is 
rejected, we do not perform a second stage update, and add the current value of 9 to our sequence 
of sampled values. This can be viewed as a second stage update where the proposal is the current 
state — to do such an update, no computations need be performed. 

Suppose that p\ is the ensemble density given the chosen part of the observed sequence, 
and q{9*\9) is the proposal density for constructing the first stage update. Then the acceptance 
probability for the first stage update is given by 

• A Pl ((x^,9*),...,(x( K \9*)) q (9\9*) \ 

If p is the ensemble density given the entire sequence, the acceptance probability for the second 
stage update is given by 

/ o((xW 9*) ( X W 9*))mm(l 
min 1, ^ ^) (12) 

V n((xW 9) (x( R ) 9)) min II eiM^r^i^lMtm 
p{{xy [xy u)) mini i, pi{{x (i) fi , )y _ Xx (K) 

Regardless of whether 9*), . . . , {x^ K \ 9*))q{9\9*) < 9), . . . , {x^ K \ 9))q{9* \9) or 

vice versa, the above ratio simplifies to 

. ( p{{x {1 \ n • • • , (s (J °, 9*)) Pl ((xW,9*), . . . , (gg, 9*))q(9\9*) \ 
mm ^' p((xW,9),...,(x^,9)) Pl ((xW,9),...,(xW,9))q(9*\9) ) 1 ] 

Choosing a part of the observed sequence for the first stage update can be aided by looking at the 
acceptance rates at the first and second stages. We need the moves accepted at the first stage to 
also be accepted at a sufficiently high rate at the second stage, but we want the acceptance rate at 
the first stage to be low so the method will have an advantage over the ensemble method in terms 
of computational efficiency. We can also look at the 'false negatives' for diagnostic purposes, that 
is, how many proposals rejected at the first step would have been accepted had we looked at the 
entire sequence when deciding whether to accept. 

We are free to select any portion of the observed sequence to use for the first stage. We will 
look here at using a partial sequence for the first stage consisting of observations starting at n\ 
and going until the end, at time N. This is appropriate for our example later in the paper, where 
we only have observations past a certain point in time. 

For this scenario, to perform a first stage update, we need to perform a backward pass. As 
we perform a backward pass to do the first stage proposal, we save the vector of "backward" 
probabilities. Then, if the first stage update is accepted, we start the recursion for the full 
sequence using these saved backward probabilities, and compute the ensemble density given the 
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entire sequence, avoiding recomputation of backward probabilities for the portion of the sequence 
used for the first stage. 

To compute the backward probabilities ^(xj), we perform a backward recursion, starting at 
time N. We first set f3 N (x N ) to 1 for all x N G Pn- We then compute, for n\ < i < N 

L 

= ^P{xf +l \xi)(5 i+l {xf +l )^(z i+l \xf +1 ) (14) 
1=1 

We compute the first stage ensemble density p 1 as follows 

Pl ((x^,e), . . . , (/>fl)) = tt(60 j^p{xl)p{y n Axl)p ni {xl) (15) 

1=1 

We do not know p(x ni ), but we can choose a substitute for it (which affects only performance, 
not correctness). One possibility is a uniform distrubution over the pool states at rii, which is 
what we will use in our example below. 

The ensemble density for the full sequence can be obtained by performing the backward 
recursion up to the beginning of the sequence, and then computing 

L 

p((x^\ 6),..., (xW, 6)) = tt(0) ^TprfWl^irf 1 ) (16) 

i=i 



To see how much computation time is saved with staged updates, we measure computation 
time in terms of the time it takes to do a backward (or equivalently, forward) pass — generally, 
the most computationally expensive operation in ensemble MCMC — counting a backward pass 
to time rt\ as a partial backward pass. 

Let us suppose that the acceptance rate for stage 1 updates is a\. An ensemble update that 
uses the full sequence requires us to perform two backwards passes if we update the pool at every 
iteration, whereas a staged ensemble update will require us to perform 

N-m ni - 1 

1 + W^i +ai W^i (17) 

backward passes on average (counting a pass back only to n\ as (N — ni)/(N — 1) passes). The 
first term in the above formula represents the full backward pass for the initial value of that is 
always needed — either for a second stage update, or for mapping to a new set of pool states. 
The second term represents the partial backward pass we need to perform to complete the first 
stage proposal. The third term accounts for having to compute the remainder of a backwards 
pass if we accept the first stage proposal — hence it is weighted by the first stage acceptance 
rate. 

We can again save computation time by updating the pool less frequently. Suppose we decide 
to update the pool every M iterations. Without staging, the ensemble method would require a 
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total of M + 1 forward (or equivalently, backward) passes. With staged updates, the expected 
number of backward passes we would need to perform is 

N — ri] rii — 1 / % 

1 + + (18) 

as can be seen by generalizing the expression above to M > 1. 



7 Constructing pools of states 

An important aspect of these embedded HMM methods is the selection of pool states at each 
time, which determines the mapping T from a single state sequence to an ensemble. In this paper, 
we will only consider sampling pool states independently from some distribution with density «j 
(though dependent pool states are considered in (Neal, 2003). 

One choice of Ki is the conditional density of given z iy based on some pseudo-prior distri- 
bution r] for Xi — that is, /Cj(xj) oc r)(xi)p(zi\xi). This distribution approximates, to some extent, 
the marginal posterior of Xi given all observations. We hope that such a pool distribution will 
produce sequences in our ensemble which have a high posterior density, and as a result make 
sampling more efficient. 

To motivate how we might go about choosing rj, consider the following. Suppose we sample 
values of 9 from the model prior, and then sample hidden state sequences, given the sampled 
values of 9. We can then think of r\ as a distribution that is consistent with this distribution of 
hidden state sequences, which is in turn consistent with the model prior. In this paper, we choose 
tj heuristically, setting it to what we believe to be reasonable, and not in violation of the model 
prior. 

Note that the choice of rj affects only sampling efficiency, not correctness (as long as it is 
non-zero for all possible Xj). However, when we choose pool states in the ensemble method, we 
cannot use current values of 9, since this would make an ensemble update non-reversible. This 
restriction does not apply to the single-sequence method since in this case T is null. 



8 Performance comparisons on a 
population dynamics model 

We test the performance of the proposed methods on the Ricker population dynamics model, 
described by Wood (2003). This model assumes that a population of size iVj (modeled as a 
real number) evolves as N i+ i = rA^exp(— JVj + e«), with independent with a Normal(0, a 2 ) 
distribution, with Nq = 1. We don't observe this process directly, but rather we observe Y^s 
whose distribution is Poisson(0iVj). The goal is to infer 9 = (r,a,(f>). This is considered to be a 
fairly complex inference scenario, as evidenced by the application of recently developed inference 
methods such as Approximate Bayesian Computation (ABC) to this model. (See Fearnhead, 
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Prangle (2012) for more on the ABC approach.) This model can be viewed as a non-linear state 
space model, with TV, as our state variable. 

MCMC inference in this model can be inefficient for two reasons. First, when the value of a 2 
in the current MCMC iteration is small, consecutive N^s are highly dependent, so the distribution 
of each iVj, conditional on the remaining iVj's and the data, is highly concentrated, making it hard 
to efficiently sample state sequences one state at a time. An MCMC method based on embedding 
an HMM into the state space, either the single-sequence method or the ensemble method, can 
potentially make state sequence sampling more efficient, by sampling whole sequences at once. 
The second reason is that the distribution of 9, given a single sequence of N^s and the data, can 
be concentrated as well, so we may not be able to efficiently explore the posterior distribution of 9 
by alternately sampling 9 and the iVj's. By considering an ensemble of sequences, we may be able 
to propose and accept larger changes to 9. This is because the posterior distribution of 9 summed 
over an ensemble of state sequences is less concentrated than it is given a single sequence. 

To test our MCMC methods, we consider a scenario similar to those considered by Wood 
(2010) and Fearnhead and Prangle (2012). The parameters of the Ricker model we use are 
r = exp(3.8),<7 = 0.15,0 = 2. We generated 100 points from the Ricker model, with only 
observed from time 51 on, mimicking a situation where we do not observe a population right from 
its inception. 

We put a Uniform(0, 10) prior on log(r), a Uniform(0, 100) prior on <j), and a Uniform[log(0.1), 0] 
prior on log(cr). Instead of N iy we use Mj = log(0iVj) as our state variables, since we found that 
doing this makes MCMC sampling more efficient. With these state variables, our model can be 
written as 

Mi ~ N(\og(r) + log(0) - 1, a 2 ) (19) 
M^Mi.i ~ JV(log(r) + Mi_i - exp(M i _i)/0, a 2 ), i = 2, . . . , 100 (20) 
Yi\Mi ~ Poisson(exp(M i )), i = 51, . . . , 100 (21) 

Furthermore, the MCMC state uses the logarithms of the parameters, (log(r), log(a), log(0)). 

For parameter updates in the MCMC methods compared below, we used independent normal 
proposals for each parameter, centered at the current parameter values, proposing updates to 
all parameters at once. To choose appropriate proposal standard deviations, we did a number 
of runs of the single sequence and the ensemble methods, and used these trial runs to estimate 
the marginal standard deviations of the logarithm of each parameter. We got standard deviation 
estimates of 0.14 for log(r), 0.36 for log(a), and 0.065 for log(0). We then scaled each estimated 
marginal standard deviation by the same factor, and used the scaled estimates as the proposal 
standard deviations for the corresponding parameters. The maximum scaling we used was 2, 
since beyond this our proposals would often lie outside the high probability region of the marginal 
posterior density. 

We first tried a simple Metropolis sampler on this problem. This sampler updates the latent 
states one at a time, using Metropolis updates to sample from the full conditional density of each 



11 



state M t , given by 



p{mi\y 5 i, ■■■ ,2/ioo, m-i) oc p(mi)p(m 2 |mi) (22) 

p(mi|y 5 i, ... ,yioo, oc p(m i |m i _i)p(m i+ i|m i ), 2 < z < 50 (23) 

p(mi|y 5 i, ••• ,2/ioo, oc p(mi|mi_i)p(2/i|mi)p(mi + i|mi), 51 < i < 99 (24) 

P(m 100 \y 51 , . . . , ?/ioo, m_ioo) oc M m ioo|™99M2/ioo|™ioo) (25) 

We started the Metropolis sampler with parameters set to prior means, and the hidden states to 
randomly chosen values from the pool distributions we used for the embedded HMM methods 
below. After we perform a pass over the latent variables, and update each one in turn, we perform 
a Metropolis update of the parameters, using a scaling of 0.25 for the proposal density. 

The latent states are updated sequentially, starting from 1 and going up to 100. When 
updating each latent variable Mi, we use a Normal proposal distribution centered at the current 
value of the latent variable, with the following proposal standard deviations. When we do not 
observe j/i, or j/i = 0, we use the current value of a from the state times 0.5. When we observe 



2/j > 0, we use a proposal standard deviation of 1/yl/ff 2 + 2/i (with a from the state). This choice 
can be motivated as follows. An estimate of precision for Mj given M*_i is 1/cr 2 . Furthermore, 
Yi is Poisson(0iVj), so that Var(0A^) « y { and Var(log(0iV i )) = Var(Mj) « l/yi- So an estimate 
for the precision of Mi given 2/i is 2/i- We combine these estimates of precisions to get a proposal 
standard deviation for the latent variables in the case when 2/i > 0. 

We ran the Metropolis sampler for 6, 000, 000 iterations from five different starting points. 
The acceptance rate for parameter updates was between about 10% and 20%, depending on the 
initial starting value. The acceptance rate for latent variable updates was between 11% and 84%, 
depending on the run and the particular latent variable being sampled. 

We found that the simple Metropolis sampler does not perform well on this problem. This 
is most evident for sampling a. The Metropolis sampler appears to explore different regions of 
the posterior for a when it is started from different initial hidden state sequences. That is, the 
Metropolis sampler can get stuck in various regions of the posterior for extended periods of time. 
An example of the behaviour of the Metropolis sampler be seen on Figure [TJ The autocorrelations 
for the parameters are so high that accurately estimating them would require a much longer run. 

This suggests that more sophisticated MCMC methods are necessary for this problem. We 
next looked at the single-sequence method, the ensemble method, and the staged ensemble 
method. 

All embedded MCMC-based samplers require us to choose pool states for each time i. For 
the i's where no 2/i's are observed, we choose our pool states by sampling values of exp(Mj) from 
a pseudo-prior r\ for the Poisson mean exp(Mj) — a Gamma(fc, 9) distribution with k = 0.15 and 
9 = 50 — and then taking logarithms of the sampled values. For the i's where we observe 2/i's we 
choose our pool states by sampling values of exp(Mj) from the conditional density of exp(Mj)|?/j 
with rj as the pseudo-prior. Since li|Mj is Poisson(exp(Mj)), the conditional density of exp(Mi)\y i 
has a Gamma(A; + 2/i, 9/ (1 + 9)) distribution. 

It should be noted that since we choose our pool states by sampling exp(Mj)'s, but our model 
is written in terms of Mj, the pool density Ki must take this into account. In particular, when 2/i 
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Figure 1: An example run of the simple Metropolis method. 
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Figure 2: An example ensemble method run, with 120 pool states and proposal scaling 1.4. 
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is unobserved, we have 

Kj(mj) = r ^fl fc exp(A;mj - exp(mj)/0), -oo < m, < oo (26) 

and when yi is observed we replace k with k + yi and 6 1 with 9/(1 + 9). We use the same way of 
choosing the pool states for all three methods we consider. 

The staged ensemble MCMC method also requires us to choose a portion of the sequence to 
use for the first stage update. On the basis of several trial runs, we chose to use the last 20 points, 
i.e. rii = 81. 

As we mentioned earlier, when likelihood evaluations are computationally inexpensive relative 
to sequence updates, we can do multiple parameter updates for each update of the hidden state 
sequence, without incurring a large increase in computation time. For the single-sequence method, 
we do ten Metropolis updates of the parameters for each update of the hidden state sequence. For 
the ensemble method, we do five updates of the parameters for each update of the ensemble. For 
staged ensemble MCMC, we do ten parameter updates for each update of the ensemble. These 
choices were based on numerous trial runs. 

We thin each of the single-sequence runs by a factor of ten when computing autocorrelation 
times — hence the time per iteration for the single-sequence method is the time it takes to do a 
single update of the hidden sequence and ten Metropolis updates. We do not thin the ensemble 
and staged ensemble runs. 

To compare the performance of the embedded HMM methods, and to tune each method, we 
looked at the autocorrelation time, r, of the sequence of parameter samples, for each parameter. 
The autocorrelation time can be thought of as the number of samples we need to draw using our 
Markov chain to get the equivalent of one independent point (Neal (1993)). It is defined as 

oo 

r = l + 2^p fe (27) 

k=l 

where pk is the autocorrelation at lag k for a function of state that is of interest. Here, pk = 
lk/lo, where % is the estimated autocovariance at lag k. We estimate each 7 fe by estimating 
autocovariances using each of the five runs, using the overall mean from the five runs, and then 
averaging the resulting autocovariance estimates. We then estimate r by 

K 

f = l + 2^p fc (28) 

k=l 

Here, the truncation point K is where the remaining pVs are nearly zero. 

For our comparisons to have practical validity, we need to multiply each estimate of autocor- 
relation time estimate by the time it takes to perform a single iteration. A method that produces 
samples with a lower autocorrelation time is often more computationally expensive than a method 
that produces samples with a higher autocorrelation time, and if the difference in computation 
time is sufficiently large, the computationally cheaper method might be more efficient. Compu- 
tation times per iteration were obtained with a program written in MATLAB on a Linux system 
with an Intel Xeon X5680 3.33 GHz CPU. 
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0.09 
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213 
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400,000 


0.17 
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58 
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0.61 
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134 
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82 
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40 


0.6 


13% 


100,000 


0.23 


496 


335 


897 


114 


77 


206 




60 


1 


11% 


100,000 


0.34 


187 


115 


167 


64 


39 


57 


Ensemble 


80 


1 


16% 


100,000 


0.47 


107 


55 


90 


50 


26 


42 




120 


1.4 


14% 


100,000 


0.76 


52 


41 


45 


40 


31 


34 




180 


1.8 


12% 


100,000 


1.26 


35 


27 


29 


44 


34 


37 




40 


1 


30, 15% 


100,000 


0.10 


692 


689 


1201 


69 


69 


120 




60 


1.4 


27, 18% 


100,000 


0.14 


291 


373 


303 


41 


52 


42 


Staged 


80 


1.4 


30, 25% 


100,000 


0.21 


187 


104 


195 


39 


22 


41 


Ensemble 


120 


1.8 


25, 29% 


100,000 


0.29 


75 


59 


70 


22 


17 


20 




180 


2 


23, 32% 


100,000 


0.45 


48 


44 


52 


22 


20 


23 



Table 1: Comparison of autocorrelation times. 



For each number of pool states considered, we started the samplers from five different initial 
states. Like with the Metropolis method, the parameters were initialized to prior means, and the 
hidden sequence was initialized to states randomly chosen from pool distribution at each time. 
When estimating autocorrelations, we discarded the first 10% of samples of each run as burn-in. 

An example ensemble run is shown in Figure [2} Comparing with Figure [TJ one can see that the 
ensemble run has an enormously lower autocorrelation time. Autocorrelation time estimates for 
the various ensemble methods, along with computation times, proposal scaling, and acceptance 
rates, are presented in Table [TJ For the staged ensemble method, the acceptance rates are shown 
for the first stage and second stage. We also estimated the parameters of the model by averaging 
estimates of the posterior means from each of the five runs for the single-sequence method with 
40 pool states, the ensemble method with 120 pool states, and the staged ensemble method with 
120 pool states. The results are presented in Table [2j The estimates using the three different 
methods do not differ significantly. 



Method 


r 




a 


<t> 


Single-Sequence 

Ensemble 
Staged Ensemble 


44.46 (± 0.09) 
44.65 (± 0.09) 
44.57 (± 0.04) 


0.2074 
0.2089 
0.2089 


(± 0.0012) 
(± 0.0009) 
(± 0.0010) 


1.9921 (± 0.0032) 
1.9853 (± 0.0013) 
1.9878 (± 0.0015) 



Table 2: Estimates of posterior means, with standard errors of posterior means shown in brackets. 
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From these results, one can see that for our Ricker model example, the single-sequence method 
is less efficient than the ensemble method, when both methods are well-tuned and computation 
time is taken into account. We also found that the the staged ensemble method allows us to 
further improve performance of the ensemble method. In detail, depending on the parameter one 
looks at, the best tuned ensemble method without staging (with 120 pool states) is between 1.9 
and 12.0 times better than the best tuned single-sequence method (with 40 pool states). The 
best tuned ensemble method with staging (120 pool states) is between 3.4 and 20.4 times better 
than the best single-squence method. 

The large drop in autocorrelation time for a for the single-sequence method between 20 and 
40 pool states is due to poor mixing in one of the five runs. To confirm whether this systematic, 
we did five more runs of the single sequence method, from another set of starting values, and 
found that the same problem is again present in one of the five runs. This is inidicative of a risk 
of poor mixing when using the single-sequence sampler with a small number of pool states. We 
did not observe similar problems for larger numbers of pool states. 

The results in Table [T] show that performance improvement is greatest for the parameter (p. 
One reason for this may be that the posterior distribution of <fr given a sequence is significantly 
more concentrated than the marginal posterior distribution of </>. Since for a sufficiently large 
number of pool states, the posterior distribution given an ensemble approximates the marginal 
posterior distribution, the posterior distribution of given an ensemble will become relatively 
more diffuse than the posterior distributions of r and a. This leads to a larger relative performance 
improvement when sampling values of 0. 

Evidence of this can be seen on Figure [3] To produce it, we took the hidden state sequence and 
parameter values from the end of one of our ensemble runs, and performed Metropolis updates 
for the parameters, while keeping the hidden state sequence fixed. We also took the same hidden 
state sequence and parameter values, mapped the hidden sequence to an ensemble of sequences 
by generating a collection of pool states (we used 120) and peformed Metropolis updates of the 
parameter part of the ensemble, keeping the pool states fixed. We drew 50, 000 samples given a 
single fixed sequence and 2,000 samples given an ensemble. 




r r sigma 



Figure 3: An illustration to aid explanation of relative performance. Note that the fixed sequence 
dots are smaller, to better illustrate the difference in the spread between the fixed sequence and 
two other distributions. 
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Visually, we can see that the posterior of 9 given a single sequence is significantly more 
concentrated than the marginal posterior, and the posterior given a fixed ensemble of sequences. 
Comparing the standard deviations of the posterior of 9 given a fixed sequence, and the marginal 
posterior of 9, we find that the marginal posterior of has a standard deviation about 21 times 
larger than the posterior of given a single sequence. The marginal posteriors for r and a have 
a posterior standard deviation larger by a factor of 5.2 and 6.0. The standard deviation of the 
posterior given our fixed ensemble of sequences is greater for by a factor of 11, and by factors 
of 4.3 and 3.2 for r and a. This is consisent with our explanation above. 

We note that the actual timings of the runs are different from what one may expect. In 
theory, the computation time should scale as nL 2 , where L is the number of pool states and n 
is the number of observations. However, the implementation we use, in MATLAB, implements 
the forward pass as a nested loop over n and over L, with another inner loop over L vectorized. 
MATLAB is an interpreted language with slow loops, and vectorizing loops generally leads to 
vast performance improvements. For the numbers of pool states we use, the computational cost 
of the vector operation corresponding to the inner loop over L is very low compared to that of the 
outer loop over L. As a result, the total computational cost scales approximately linearly with 
L, in the range of values of L we considered. An implementation in a different language might 
lead to different optimal pool state settings. 

The original examples of Wood and Fearnhead and Prangle used = 10 instead of = 2. 
We found that for = 10, the ensemble method still performs better than the single sequence 
method, when computation time is taken into account, though the difference in performance is 
not as large. When is larger, the observations yi are larger on average as well. As a result, 
the data is more informative about the values of the hidden state variables, and the marginal 
posteriors of the model parameters are more concentrated. As a result of this, though we would 
still expect the ensemble method to improve performance, we would not expect the improvement 
to be as large as when = 2. 

Finally, we would like to note that if the single-sequence method is implemented already, im- 
plementing the ensemble method is very simple, since all that is needed is an additional call of the 
forward pass function and summing of the final forward probabilities. So, the performance gains 
from using an ensemble for parameter updates can be obtained with little additional programming 
effort. 



9 Conclusion 

We found that both the embedded HMM MCMC method and its ensemble extension perform 
significantly better than the ordinary Metropolis method for doing Bayesian inference in the 
Ricker model. This suggests that it would be promising to investigate other state space models 
with non-linear state dynamics, and see if it is possible to use the embedded HMM methods we 
described to perform inference in these models. 

Our results also show that using staged proposals further improves the performance of the 
ensemble method. It would be worthwhile to look at other scenarios where this technique might 
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be applied. 



Most importantly, however, our results suggest that looking at multiple hidden state sequences 
at once can make parameter sampling in state space models noticeably more efficient, and so 
indicate a direction for further research in the development of MCMC methods for non-linear, 
non-Gaussian state space models. 
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